Skip to content

feat(cosmos-perf): record server-reported request duration as backend latency#4316

Open
tvaron3 wants to merge 9 commits intoAzure:release/azure_data_cosmos-previewsfrom
tvaron3:feat/perf-backend-latency-v2
Open

feat(cosmos-perf): record server-reported request duration as backend latency#4316
tvaron3 wants to merge 9 commits intoAzure:release/azure_data_cosmos-previewsfrom
tvaron3:feat/perf-backend-latency-v2

Conversation

@tvaron3
Copy link
Copy Markdown
Member

@tvaron3 tvaron3 commented Apr 29, 2026

Summary

Enhances the Cosmos DB Rust perf runner with two measurement improvements:

1. Backend (Server-Reported) Latency

Parses the x-ms-request-duration-ms response header and tracks it alongside the existing client-observed wall-clock latency. This separates network transit time from server processing time in performance reports.

2. Cgroup CPU Utilization (cgroup_cpu_percent)

Adds a new metric that reads cgroupv2 cpu.stat and cpu.max to compute CPU utilization relative to the container's allocated quota. This matches what kubectl top pods reports and replaces the misleading system_cpu_percent (which reads /proc/stat and shows host-level CPU, appearing artificially low in containers).

Changes

  • stats.rs: New read_cgroup_cpu_percent() function with delta-based measurement, division-by-zero guards, and safe u128→u64 clamping in histogram recording
  • runner.rs: Wire cgroup_cpu_percent: Option<f32> through PerfResult (serialized to Cosmos DB → ADX)
  • sdk/cosmos/.cspell.json: Add cgroup terminology to ignore list

Testing

  • Verified on AKS pods in cosmos-perf-rg — cgroup CPU reports ~78% matching kubectl top
  • Data flows through Cosmos DB change feed → ADX → Grafana dashboard

… latency

Reads x-ms-request-duration-ms response header on every Cosmos request
in the perf binary and emits backend_{min,max,mean,p50,p90,p99}_ms per
operation per reporting interval. Surfaces server-side processing time
separately from the client-observed wall-clock latency so network plus
client-queue overhead can be isolated downstream.

Implementation:
- New helper extract_backend_duration in operations/mod.rs parses the
  header value as milliseconds (f64) into a Duration.
- Operation::execute now returns Result<Option<Duration>> instead of
  Result<()>; each per-op implementation reads the header off the
  response (or sums across pages for QueryItems via into_pages()).
- Stats gains a parallel HdrHistogram for backend durations; samples
  are independent of client samples (intervals where 0 backend
  durations were observed surface as None on Summary, which serializes
  as null and is skipped via skip_serializing_if).
- PerfResult struct gains 6 Option<f64> backend_*_ms fields.

Existing fields, behaviour, and JSON keys are unchanged. Old payloads
without backend_* keys ingest cleanly into ADX (the schema mapping
treats missing keys as null).

Tests:
- backend_durations_aggregate_separately_from_client verifies the two
  histograms are independent.
- backend_summary_is_none_when_no_samples verifies the all-None path
  when the header is absent.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
tvaron3 and others added 4 commits April 29, 2026 17:10
…ws' into feat/perf-backend-latency-v2

# Conflicts:
#	sdk/cosmos/azure_data_cosmos_perf/src/operations/create_item.rs
#	sdk/cosmos/azure_data_cosmos_perf/src/operations/upsert_item.rs
Renamed bmean -> backend_mean_dur, bmin -> back_min, bmax -> back_max
to avoid cspell 'Unknown word' errors in CI Analyze step.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ting

Read cgroupv2 cpu.stat and cpu.max to compute pod-level CPU utilization
that matches what kubectl top reports. Falls back to None when not running
in a cgroup (e.g., local dev). Wire through PerfResult for ADX ingestion.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add 'cgroupv' and 'usec' to the allowed words list to fix CI
spell-check failures from the cgroup CPU metric addition.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@tvaron3 tvaron3 marked this pull request as ready for review April 30, 2026 17:08
Copilot AI review requested due to automatic review settings April 30, 2026 17:08
Move 'cgroupv' and 'usec' from .vscode/cspell.json to the local
sdk/cosmos/.cspell.json ignoreWords list. Reverts the root config.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds backend (server-reported) latency measurement to the Cosmos perf runner by parsing x-ms-request-duration-ms, aggregating it alongside existing wall-clock latency, and emitting per-interval backend percentile/summary fields.

Changes:

  • Parse x-ms-request-duration-ms into an optional Duration and plumb it through Operation::execute.
  • Track backend-duration histograms separately from client wall-clock latency and emit backend summary stats (plus a “BackendP99” column in the console report).
  • Add cgroup CPU quota utilization metric reporting (cgroupv2) and update editor spellchecker word list.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
sdk/cosmos/azure_data_cosmos_perf/src/operations/mod.rs Adds extract_backend_duration() and changes Operation::execute to return Option<Duration>.
sdk/cosmos/azure_data_cosmos_perf/src/operations/create_item.rs Returns backend duration extracted from response headers.
sdk/cosmos/azure_data_cosmos_perf/src/operations/read_item.rs Returns backend duration extracted from response headers.
sdk/cosmos/azure_data_cosmos_perf/src/operations/upsert_item.rs Returns backend duration extracted from response headers.
sdk/cosmos/azure_data_cosmos_perf/src/operations/query_items.rs Iterates query by pages and sums backend duration across pages.
sdk/cosmos/azure_data_cosmos_perf/src/stats.rs Adds backend histograms/summary fields and introduces cgroup CPU percent metric collection/printing.
sdk/cosmos/azure_data_cosmos_perf/src/runner.rs Records backend durations into stats and serializes backend/cgroup metrics in result documents.
.vscode/cspell.json Adds words related to the new cgroup metrics (and reformats the file).

Comment thread .vscode/cspell.json Outdated
Comment thread .vscode/cspell.json Outdated
Comment thread sdk/cosmos/azure_data_cosmos_perf/src/stats.rs Outdated
Comment thread sdk/cosmos/azure_data_cosmos_perf/src/stats.rs Outdated
Comment thread sdk/cosmos/azure_data_cosmos_perf/src/stats.rs
Comment thread sdk/cosmos/azure_data_cosmos_perf/src/runner.rs
tvaron3 and others added 3 commits April 30, 2026 10:15
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Guard against u128→u64 truncation in histogram recording by
  clamping with .min(u64::MAX as u128) before cast
- Add division-by-zero guard for period_usec==0 and cores<=0.0
  in cgroup CPU calculation
- Add 'cgroupv2' to sdk/cosmos/.cspell.json ignore list

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Restore original 2-space indent and sort order, keeping diff to just
the 3 added words (cgroupv, cgroupv2, usec).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Member

@FabianMeiswinkel FabianMeiswinkel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-project-automation github-project-automation Bot moved this from Todo to Approved in CosmosDB Go/Rust Crew Apr 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Cosmos The azure_cosmos crate

Projects

Status: Approved

Development

Successfully merging this pull request may close these issues.

3 participants